Mục đích: Reference đầy đủ để bạn defend từng task, giải thích cho non-tech, và trả lời trade-off questions trong meeting. Cấu trúc:
- Phần 1: Hiểu hệ thống & toàn bộ tools (giải thích thường dân + ví dụ + Azure equivalent)
- Phần 2: Detail từng task/subtask trong WBS
SingPost = bưu chính Singapore, đang xây Logistics Platform trên GCP (Google Cloud) gồm: BFF (Backend-for-Frontend) cho merchant/mobile, các business services (Account, Pricing, Routing, AWB/Label), AI Service Layer (LLM Gateway, RAG, Guardrails), 8 Cloud SQL Postgres databases, Apigee làm API Gateway, Temporal/RabbitMQ cho long-running workflows, và Backstage làm developer portal.
Họ thuê team mình build CI/CD pipeline cho ~80 deployable units (50 microservices + 30 monoliths), trên 5 môi trường: Dev → SIT → UAT → Pre-Prod → Prod.
Analogy thường dân: Tưởng tượng SingPost là một nhà máy chia thư khổng lồ. CI/CD của mình là dây chuyền kiểm tra chất lượng + tự động hóa: code dev viết → quét lỗi → đóng gói → chuyển sang môi trường test → khi mọi thứ ổn thì sếp duyệt mới đẩy lên thật.
Đây là điểm rất quan trọng, vì khác hẳn cách Azure DevOps thường gộp chung:
| Loại repo | Chứa gì | Ai sửa | Branching |
|---|---|---|---|
| App Source (per service) | Code Node.js, Dockerfile, unit tests | Domain developers | Trunk-based: main + feature/* |
App Config (GitOps) — singpost-gitops-config |
Helm/Kustomize manifests K8s | Domain team + Platform | Single main + directory-per-env |
IaC — singpost-infra |
Terraform/Terragrunt code | Platform team only | Trunk-based + gated apply |
CICD Components — singpost-cicd-components |
Reusable GitHub Actions workflows | DevOps team (mình) | Trunk-based |
Tại sao tách 3 repo (không gộp như nhiều cty):
- App code đổi mỗi ngày, blast radius nhỏ (1 service)
- Config K8s đổi theo promotion (dev → sit → uat...), audit trail quan trọng
- Infra đổi mỗi tuần/tháng, blast radius cực lớn (sập 1 VPC = sập tất cả)
- → RBAC khác nhau: dev không được sửa infra; ArgoCD chỉ watch 1 repo config
Compare với Azure: Tương tự như tách Azure Repos thành: app-repo (dev push),
helm-config-repo(ArgoCD/FluxCD watch),infra-repo(Bicep/Terraform). Azure DevOps cũng có concept tách này nhưng thường gộp do tiện. Best practice của Microsoft (Azure Landing Zone) cũng khuyên tách như vậy.
Trade-off:
- ✅ Clear separation, RBAC dễ enforce, ArgoCD đơn giản
- ❌ Cross-repo automation phức tạp (image tag từ app repo phải tự update vào config repo qua bot/PR)
- ❌ Dev phải hiểu cả 3 repo (steep learning curve)
main ────●────●────●────●────●────●──────► (production-ready luôn)
\ / \ / \ /
\/ \/ \/
feature/ feature/ hotfix/
(1-3 days) (1-3 days) (hours)
Rule chính (từ ADR-005 + SingPost doc):
| Branch | Mục đích | Lifetime | Protection |
|---|---|---|---|
main |
Trunk, luôn deployable | Permanent | PR + 1 approval + CI pass + CODEOWNERS |
feature/* (hoặc ft-*) |
Feature mới | 1-3 ngày max | Phải rebase trên main trước khi merge |
hotfix/* (hoặc bf-*) |
Vá khẩn cấp | Hours | 1 approval, bypass SIT/UAT |
release/* |
Stabilization (optional) | Days | Cut từ main, chỉ bug fix |
dev (SingPost-specific) |
Integration cho DEV/SIT env | Permanent | Push thẳng được nhưng cấm git push -f |
Tại sao Trunk-Based không Gitflow?
- Gitflow có
developlong-lived → merge conflict liên tục, integration chậm - Với 80 services, Gitflow là cơn ác mộng cherry-pick
- Feature flags thay cho feature branches để xử lý incomplete work
- ArgoCD canary/blue-green thay cho release branch
Câu phỏng vấn hay hỏi: "Tại sao không Gitflow?" → Trả lời: "Gitflow tốt cho versioned product (Photoshop, vài release/năm). Mình đang build SaaS deploy hàng ngày, Trunk-based phù hợp hơn vì merge nhanh, integration sớm, dễ rollback từng commit."
Đây là cách team chia rollout dần dần thay vì build full pipeline ngay (giảm risk):
Mục tiêu: Pipeline tối giản nhưng đủ deploy được Cloud Run lên Dev/UAT/Prod.
| Có | Tool |
|---|---|
| Source + orchestrator | GitHub + GitHub Actions |
| SAST + Secret scan | CodeQL + GitHub Secret Scanning |
| Lint + unit test | ESLint + Vitest (cov ≥ 60%) |
| Build | Cloud Build |
| Registry | Google Artifact Registry (GAR) |
| DB migration | Flyway |
| Deploy | Cloud Deploy → Cloud Run |
| Smoke test | curl |
| Notify | Google Chat |
Manual ở phase này: GKE deploy, Apigee deploy, rollback (gcloud command tay).
Add: Dependency Review, Checkov, Container Analysis CVE gate, Apigee CI (apigeelint+apickli+Maven), Integration tests (Testcontainers), catalog-info.yaml lint, GitHub Job Summary.
Add: Cloud Deploy → GKE (Helm), Canary 5%→25%→100%, Playwright E2E, k6 perf, Pact contracts, PII/DLT/PDPA tests, DAST (Web Security Scanner), Anti-pattern lint (Backstage v9), Buf CLI schema, PagerDuty pipeline alerts.
Add: AI Service Layer CI/CD, Binary Authorization (image signing), Flagger GKE canary, Locust soak test, Schemathesis fuzz, SLO burn-rate gate, Temporal compat check, AI Guardrails PII test, Backstage API sync, Claude AI PR review.
Tại sao phased rollout (chứ không full ngay)?
- Phase M giúp client thấy giá trị ngay tuần 2 (deploy được)
- Mỗi phase thêm 1 layer safety, không over-engineer ngay
- Team có thời gian học GCP/Apigee/Backstage
- Client có thời gian provision SA/IAM/budget
Pipeline chạy theo thứ tự sau, mỗi stage có nhiều tool song song:
[Pre-commit] → [PR & Review] → [Security Scan] → [Build & Registry]
↓
[Deployment] ← [DB Migration]
↓
[Testing] → [Observability]
Analogy: Như kiểm tra chính tả + format ngay khi đánh máy, trước khi gửi email.
| Tool | Là gì | Strength | Weakness | Azure tương đương |
|---|---|---|---|---|
| Husky + lint-staged | Git hook chạy lint/format mỗi commit | Fast feedback, không cần CI | Dev có thể skip bằng --no-verify |
Husky cũng dùng trong Azure DevOps được |
| Commitlint | Force commit message theo format (e.g. feat:, fix:) |
Dễ generate changelog | Annoying ban đầu | Không có native, dùng Commitlint |
| GitHub Secret Scanning push protection | Block push nếu detect secret (AWS key, JWT...) | Block ngay tại server | Cần GitHub Advanced Security ($$) | Azure DevOps có Credential Scanner |
| Dependabot | Bot tự tạo PR khi dep có CVE hoặc outdated | Tự động, free, native | Spam PR nhiều, đôi khi false alert | Renovate (alternative), hoặc Azure Defender for DevOps |
Replacement tool note (từ file):
- Secret scan: thay
GitHub Secret Scanning→Gitleaks pre-push hook(open-source, free nhưng phải tự setup hook) - Dependabot →
Renovate Bot(commercial, nhiều config hơn, support nhiều registry hơn)
Ví dụ thực tế: Dev tên Bình tình cờ paste AWS access key vào file
.envvà commit. Husky+Gitleaks pre-push detect → reject push với message:Secret detected in .env line 5: AKIA****. Bình phải xóa key, rotate key trên AWS console, rồi mới push được.
Analogy: Như nộp bài tập, có giáo viên + máy chấm tự động kiểm tra trước khi cho vào sổ điểm.
| Tool | Là gì | Strength | Weakness |
|---|---|---|---|
| GitHub + GitHub Actions | Source control + CI orchestrator | All-in-one, free tier hào phóng, marketplace nhiều actions | Compute giới hạn (free tier), vendor lock-in |
| GitHub Environments | Cấu hình required reviewers, deployment rules per env | Manual approval gate cực dễ setup | Cần GitHub Team plan trở lên |
| Claude (Anthropic API) AI code review | AI review PR, detect bug, suggest fix | Bắt được bug logic mà SAST không thấy | Tốn $ per token, có thể miss context |
| GitHub Actions labeler | Auto-add label dựa file path | Tự động phân loại PR | Cần config kỹ rules |
Ví dụ thực tế: Dev mở PR sửa file
apps/billing/src/payment.ts. Labeler tự gắn labeldomain:billing. PR pipeline chạy: Gitleaks → CodeQL → Vitest. CODEOWNERS rule yêu cầu @billing-team review. Claude AI bot bình luận: "Line 42 không validateamount > 0, có thể bị negative payment". Reviewer thấy, comment, dev sửa, merge.
Phase M scope: Chỉ có GitHub + GitHub Actions + GitHub Environments. AI review là Phase 3.
Analogy: Như máy soi an ninh ở sân bay - không qua được không cho vào.
| Tool | Đặc điểm |
|---|---|
| GitHub Dependency Review Action + Dependabot Alerts | Native, gate trên PR. Yêu cầu GitHub Code Security add-on (paid). Block CVE > threshold severity. |
| Snyk (replacement) | Commercial, mạnh hơn, support nhiều language, có SCA + container + IaC |
Ví dụ: PR nâng
lodashtừ 4.17.20 lên 4.17.21. Action chạy → detect: "lodash 4.17.20 có CVE-2021-23337 (prototype pollution)". Vì PR upgrade lên fixed version → pass. Nhưng nếu PR thêmlodash@4.0.0(vẫn vulnerable) → fail PR.
| Tool | Đặc điểm |
|---|---|
| CodeQL | GitHub native, free for public repo, paid for private. Datalog query trên AST. Mạnh cho injection, XSS, hardcoded creds. |
| CodeQL + Semgrep (replacement) | Semgrep nhanh hơn, rules dễ viết hơn (YAML), Free OSS. |
| SonarQube (mentioned in repo) | Code quality + security. Có Quality Gate. Cần SONAR_TOKEN + SONAR_HOST_URL. |
Ví dụ: CodeQL phát hiện
eval(req.query.cmd)trong code → severity CRITICAL → block PR. SonarQube báo: "Cyclomatic complexity 25, vượt ngưỡng 15. Refactor cần thiết."
Block hiện tại của team: SonarQube đang INVALID vì chưa có SONAR_TOKEN/SONAR_HOST_URL. Nếu client không cung cấp, team phải self-host SonarQube → cần server + Developer License để scan PR.
| Tool | Đặc điểm |
|---|---|
| GitHub Secret Scanning | Native, quét toàn history. Generate alert ở Security tab. Không block (push protection mới block). |
| Gitleaks (đang dùng trong SingPost) | OSS, custom rules qua TOML config. Có thể fail pipeline nếu detect. |
| Tool | Đặc điểm |
|---|---|
| FOSSA | Scan license của dependency, alert nếu có GPL trong product proprietary. |
Ví dụ: Team dùng
prettier(MIT - OK) nhưng accidentally pull thêmsomething-gpl(GPL-3.0 - LÂY virus license). FOSSA alert → block merge.
| Tool | Đặc điểm |
|---|---|
| Checkov | Quét Terraform/CloudFormation/K8s/Helm. 1000+ rules. Free OSS. Chỉ chạy trên Infra repo, không App repo. |
Ví dụ: Dev viết Terraform tạo
google_storage_bucketkhông setuniform_bucket_level_access = true. Checkov ruleCKV_GCP_29→ fail.
| Tool | Đặc điểm |
|---|---|
| Google Cloud Web Security Scanner | GCP native, scan deployed app (URL). Detect XSS, SQL injection. Chạy sau khi deploy lên UAT. |
| OWASP ZAP (replacement) | OSS classic, customize được nhiều |
Ví dụ: App deploy lên UAT có endpoint
/search?q=.... Scanner thử?q=<script>alert(1)</script>→ response echo lại → DETECT XSS.
| Tool | Đặc điểm |
|---|---|
| Binary Authorization | GCP-native. Chỉ cho deploy image đã được signed bởi attestor (sau khi Container Analysis pass). Closes supply chain gap. |
Ví dụ: Hacker break vào GAR, upload image
malicious-image:v1. Khi Cloud Run trigger deploy image này → Binary Authorization check: "image này không có attestation từ Cloud Build" → BLOCK deploy.
| Tool | Đặc điểm |
|---|---|
| Pub/Sub Schema Registry | GCP native, enforce schema khi publish. |
| Buf CLI | Detect breaking change ở Protobuf schema trên CI. |
Ví dụ: Service A publish
OrderCreatedevent với fieldtotal_amount. Dev đổi tên thànhamount→ Buf CLI detect breaking change → fail PR. Nếu lọt qua → Pub/Sub Schema Registry sẽ reject publish ở runtime.
| Tool | Đặc điểm |
|---|---|
| Schemathesis | OSS, sinh request random từ OpenAPI spec để fuzz API. Detect crash/edge case. |
Analogy: Như đóng gói hàng hóa thành thùng giấy, dán nhãn, đưa vào kho.
| Tool | Đặc điểm |
|---|---|
| Google Cloud Build | GCP native, build trong GCP network (nhanh khi push GAR). Auth qua OIDC từ GitHub Actions. |
| Docker build trên GitHub Actions runner (replacement) | Build trực tiếp trên ubuntu-latest runner. Dùng khi không GCP, hoặc cần build nhanh không qua Cloud Build. |
Lựa chọn: Phase M dùng Cloud Build (vì client là GCP). Nếu sau này có client Azure/AWS, switch sang GitHub Actions native build.
Ví dụ: PR merge vào
main→ GitHub Actions trigger → authenticate GCP qua WIF → submit Cloud Build job → Cloud Build pull repo, rundocker buildvới multi-stage Dockerfile, scan image, push lên GAR với taggit-sha-abc123+ semverv1.2.3.
| Tool | Đặc điểm |
|---|---|
| Google Artifact Registry + Container Analysis | Auto-scan khi image push lên GAR. Generate vulnerability report. Block deploy nếu CVE > threshold (Phase 1+). |
| Trivy (replacement) | OSS, scan local, nhanh, support nhiều registry. |
Ví dụ: Image
app:v1.0push lên GAR. Container Analysis scan → detect base imagenode:18.0có CVE-2023-XXX (CRITICAL). Pipeline Phase 1+ sẽ fail. Dev phải upgrade base image lênnode:18.19.
| Tool | Đặc điểm |
|---|---|
| Helm | OSS, package K8s manifest thành chart, versioning, template. |
| GAR OCI mode | Push Helm chart lên GAR như OCI artifact (không cần Chart Museum). |
Ví dụ: Repo có
charts/myapp/values.yaml. CI chạyhelm lint→ pass →helm package→ push lênoci://us-docker.pkg.dev/PROJECT/helm-repo/myapp:1.2.3.
| Tool | Đặc điểm |
|---|---|
| apigeelint | Quét Apigee proxy XML có lỗi cú pháp/policy không. |
| apickli | Test functional cho Apigee proxy (Cucumber-style). |
| Apigee Maven plugin | Deploy proxy lên Apigee env (dev/uat/prod). |
Ví dụ: PR sửa file
proxies/billing/policies/spike-arrest.xml→ apigeelint scan → apickli chạy test "GET /billing trả 200 trong 500ms" → Maven plugin deploy lên Apigeedevorg.
| Tool | Đặc điểm |
|---|---|
| Flyway | OSS, versioned SQL migration. File V1__create_users.sql, V2__add_email_column.sql. Mỗi env có schema_history table tracking. |
Ví dụ: PR thêm file
db/migration/V12__add_phone_to_users.sql. CI chạyflyway dry-run→ preview SQL → review. Khi deploy DEV:flyway migrate→ apply lên DEV DB. Same cho UAT, PROD.
Quan trọng: Flyway phải chạy TRƯỚC khi deploy app (vì app v2 cần schema v2). Trong Phase M, Flyway là non-negotiable từ day 1 vì schema mismatch = production outage.
Analogy: Như giao hàng — có giao nhanh (Cloud Run), giao theo dây chuyền lắp ráp (GKE), giao thử ít người trước (canary).
| Tool | Đặc điểm |
|---|---|
| Google Cloud Deploy | Managed CD service. Tạo "Release" → promote qua targets (dev → uat → prod). Có promotion approval, rollback. |
Compare Azure: Tương tự Azure Container Apps + Azure Pipelines Stage approvals.
Ví dụ: Image build xong → tạo
gcloud deploy releases create release-abc --delivery-pipeline=app-pipeline --images=app=gcr.io/.../app:abc. Release tự rollout lên DEV target. Sau khi DEV pass, manual click "Promote to UAT" trên Cloud Console.
| Tool | Đặc điểm |
|---|---|
| Google Cloud Deploy for GKE (Helm renderer) | Cloud Deploy có GKE target type. Render Helm chart, apply lên cluster. |
| helm upgrade trực tiếp trong GitHub Actions (replacement) | Đơn giản hơn, nhưng mất governance, không có promote/approval/rollback của Cloud Deploy. |
| Tool | Đặc điểm |
|---|---|
| Cloud Deploy native canary (cho Cloud Run) | Built-in 5% → 25% → 100% with watch periods. |
| Flagger (cho GKE, Phase 3) | OSS, progressive delivery controller, traffic shifting based on Prometheus/Cloud Monitoring metrics. |
| Cloud Monitoring | Source metrics: error rate, p99 latency. |
Ví dụ: PROD deploy v2.0 → Cloud Deploy route 5% traffic sang v2.0. Cloud Monitoring track error rate trong 10 phút. Nếu error rate ≤ 1% → bump lên 25%. Tiếp tục 10 phút → 100%. Nếu spike > 1% → auto rollback.
| Tool | Đặc điểm |
|---|---|
| Terraform | OSS, declarative IaC, state file. |
| Terragrunt | Wrapper Terraform, DRY config across env. Mỗi env (dev/sit/uat/preprod/prod) có folder riêng với terragrunt.hcl. |
Compare Azure: Tương tự Bicep + Azure Resource Manager, hoặc Terraform với
-var-file=dev.tfvars.
Ví dụ structure:
infra/ modules/ gke/ networking/ live/ dev/gcp/gke/terragrunt.hcl uat/gcp/gke/terragrunt.hcl prod/gcp/gke/terragrunt.hclDev sửa
live/dev/gcp/gke/terragrunt.hcl→ CI scope phát hiện chỉ envdevthay đổi → chỉ chạyterragrunt plan+applycho dev.
Analogy: Như test ô tô trước khi xuất xưởng: test từng bộ phận (unit), test toàn xe (integration), test trên đường (E2E), test stress (load test).
| Tool | Phase | Là gì | Khi nào dùng |
|---|---|---|---|
| Vitest | M | Unit test JS/TS (giống Jest nhưng nhanh hơn) | Test 1 function, 1 module |
| Vitest + Supertest + Testcontainers | 1 | Integration test với real DB/Redis trong Docker container | Test API + DB tương tác thật |
| Playwright | 2 | E2E browser test (Chrome/Firefox/Safari) | Test user journey: click button → submit form → check DB |
| Pact | 2 | Consumer-driven contract test | Service A consume Service B → đảm bảo B không break A khi update |
| k6 | 2 | Load test, perf threshold (p95, p99, error rate) | UAT load test trước prod |
| Locust on Cloud Run Jobs | 3 | Soak test (chạy giờ-ngày để phát hiện memory leak) | Pre-prod ổn định lâu dài |
| Schemathesis | 3 | API fuzz test từ OpenAPI spec | Phát hiện edge case không nghĩ tới |
| Temporal CLI compat check | 3 | Check workflow type compat trước deploy | Tránh break in-flight workflows |
| DLT chaos test (custom GH Action + gcloud) | 2 | Test Dead Letter Topic khi service down | Verify event không mất |
| PII compliance test (bq CLI + gcloud) | 2 | Test PII tokenization, PDPA erasure | Compliance Singapore |
| curl (smoke test) | M | Health check + critical path post-deploy | Verify deploy success |
Ví dụ Pact: Service
billinggọi serviceaccounts/v1/balance. Pact contract:GIVEN account 123 exists WHEN GET /accounts/v1/balance?id=123 THEN return 200 + { "balance": number }Khi team
accountsdeploy → CI runpact verify→ check fieldbalancecòn lànumber(không đổi sangstring). Nếu break → fail trước khi deploy.
Ví dụ DLT chaos test: GitHub Action gọi
gcloud run services delete event-log(giả lập downtime) → publish 100 events lên Pub/Sub → query DLT subscription → assert tất cả 100 events vào DLT (không mất). Sau test,gcloud run services deploykhôi phục.
Ví dụ PII test: Workflow:
- Publish event chứa
{ "email": "user@test.com" }→- Pull từ business-event-log →
- Assert có
pii_ref_id(đã token hóa), KHÔNG có- Call erasure API →
- Query Cloud SQL PII Vault → assert row deleted →
- Query BigQuery → assert tombstone marker present.
Analogy: Như camera giám sát + chuông báo trộm trong nhà.
| Tool | Phase | Là gì |
|---|---|---|
| Cloud Monitoring + Cloud Deploy release inspector | M+ | Health post-deploy Cloud Run, auto-rollback trigger |
| Google Cloud Monitoring SLO policies | 3 | Monitor SLO burn-rate, alert nếu burn nhanh hơn ngân sách lỗi |
| GitHub Actions Job Summary | 1 | Aggregate test result hiển thị trên Actions UI |
| Allure (replacement) | 1 | Test report đẹp hơn, lưu history |
| Google Chat webhook | M | Notify deploy status, build pass/fail vào channel team |
| PagerDuty | 2 | On-call escalation cho P1 |
| Google Cloud Alerting (replacement) | 2 | Native GCP alert, free |
Auto-rollback condition: error rate > 1% OR p99 latency > threshold OR SLO burn-rate spike.
Ví dụ: Deploy v2.0 PROD canary 5%. Sau 5 phút, error rate spike từ 0.1% → 2.5%. Cloud Deploy release inspector detect → trigger rollback → Cloud Deploy revert traffic về v1.9 → PagerDuty P1 alert → Google Chat notify "ROLLBACK v2.0 due to error spike".
Vấn đề cũ: GitHub Actions cần authenticate GCP → solution cũ là tạo Service Account, download JSON key, lưu vào GitHub Secrets. Risk: key leak = compromise toàn GCP project. Key sống vĩnh viễn.
Solution WIF: Keyless. GitHub Actions có OIDC token → GCP trust GitHub's OIDC issuer → exchange token → impersonate Service Account → tạm thời (15 phút).
Setup WIF cần:
- Tạo Workload Identity Pool (1 cái)
- Tạo Provider trust
github.com/<org>/<repo>(1 cái) - Service Account với
roles/iam.workloadIdentityUserbinding cho repo - GitHub Actions dùng
google-github-actions/auth@v2vớiworkload_identity_provider+service_account
Compare Azure: Tương tự Azure Federated Identity Credentials (Managed Identity + OIDC trust GitHub). Setup cũng tương tự: Federated Credential trên Managed Identity, trust subject GitHub.
Đây là task đầu tiên trong nhiều WBS tasks: "Configure OIDC trust between GitHub Actions and GCP (WIF)".
| Cloud Run | GKE Autopilot | |
|---|---|---|
| Là gì | Serverless container, scale 0 to N | Managed K8s, Google quản lý node |
| Tốt cho | Stateless API, web service, batch | Stateful, complex workload (Temporal, RMQ, ArgoCD, Backstage) |
| Deploy | gcloud run deploy hoặc Cloud Deploy → Cloud Run |
kubectl apply hoặc Cloud Deploy → GKE (Helm) |
| Pricing | Per request + per CPU/RAM-second | Per pod resource request |
| Cold start | Có (~1-5s) | Không (pod luôn chạy) |
SingPost dùng: Cloud Run cho BFF/DSB/BIZ services. GKE cho Temporal, RabbitMQ bridge, Backstage.
Compare Azure: Cloud Run ≈ Azure Container Apps; GKE Autopilot ≈ AKS Automatic (mới)/AKS.
Managed CD service. Khác với CI (build) — CD (deploy + promote across env).
Concepts:
- Delivery Pipeline: định nghĩa luồng deploy (DEV → UAT → PROD)
- Target: 1 env cụ thể (e.g. Cloud Run trong project
singpost-dev) - Release: 1 instance deploy với image cụ thể
- Rollout: triển khai release lên 1 target
Compare Azure: Tương tự Azure DevOps multi-stage YAML pipeline với manual approval per stage.
API gateway của GCP. SingPost dùng 3 product sets:
- External: Cloud Armor/WAF, OAuth/JWT, spike arrest, response cache, analytics → BigQuery
- Internal (IH/I contracts): Rate limit nội bộ, X-Correlation-ID, OpenAPI sync sang Backstage
- AI: AI Service Layer endpoints
CI/CD cho Apigee = lint XML proxy + test functional + deploy proxy bundle qua Maven plugin.
Auto-scan vulnerability cho image trên GAR. Output dùng làm gate cho Binary Authorization (Phase 3).
Message broker + schema enforcement. Khác Kafka (mà Azure Event Hubs Kafka emulation tương đương).
| Repo type | ITSM touchpoint | Mechanism |
|---|---|---|
| App Source | None | CI only, no deploy |
| App Config Pre-Prod | CR auto-created khi PR merge | GitHub Action call ITSM API |
| App Config Prod | CR phải Approved trước ArgoCD sync | ArgoCD PreSync hook validate CR status |
| IaC Pre-Prod/Prod | CR auto-created khi PR touch live/preprod/ or live/prod/ |
GH Action block apply until CR Approved |
Compare Azure: Tương tự ServiceNow integration trong Azure DevOps Pipelines (CR gate).
- GitHub Actions ecosystem: Native, marketplace cực phong phú, secrets management tốt, OIDC ready cho WIF
- GCP native tools (Cloud Build, Cloud Deploy, GAR, Container Analysis): Integration sâu, IAM tight, không phải maintain extra infra
- OSS tools (Gitleaks, Checkov, Terraform, Helm, Flyway, Vitest, Playwright, k6): Free, community lớn, không vendor lock-in
- Apigee: API governance enterprise-grade, analytics tốt
- Backstage: Developer portal, software catalog, golden path templates
- GitHub Actions: Compute giới hạn (free tier 2000 min/month), runner OS limited, vendor lock-in (workflow syntax không port sang GitLab/Jenkins)
- GCP-native: Khó port sang AWS/Azure nếu sau này multi-cloud. Pricing per-API-call có thể đắt khi scale
- CodeQL: Slow cho repo lớn, query language khó học (Datalog)
- SonarQube: Cần infra để self-host (hoặc trả SonarCloud), license Developer trở lên mới scan PR
- Apigee: Đắt, learning curve cao, deploy proxy chậm
- Terraform state: Phải remote backend, conflict khi nhiều người apply
- Helm: Template syntax khó debug, không type-safe
- Cloud Deploy: Mới ra, ít rules canary phức tạp như Argo Rollouts
Sheet WBS có structure: Priority → Flow → Tools → Assignee → Main Task → Sub Task → Estimation → Status → Notes
Tổng estimation: 35 man-hours (~4.4 man-days). Mục đích chung: Setup nền tảng PR/CI/CD orchestration cho 6 repos.
Là gì: Mỗi repo có 1 branch mặc định (default branch) hiển thị khi user truy cập GitHub. Theo Trunk-based, default = main.
Solution:
- Vào repo Settings → Branches → Default branch → chọn
main - Nếu repo mới tạo từ template thì set qua org template
- Có thể dùng GitHub API hoặc
ghCLI để batch:gh repo edit OWNER/REPO --default-branch main
Ví dụ: 6 repos của SingPost (4 app + 1 infra + 1 cicd-components) đều cần main là default thay vì master (default cũ).
Strength: Đơn giản, 1 lần setup
Weakness: Nếu set sai phải migrate branch (rename master → main ảnh hưởng PR đang mở)
Là gì: GitHub Rulesets (mới hơn Branch Protection Rules) — quy định ai được merge, cần bao nhiêu approval, status checks pass nào.
Solution config (đề xuất):
Ruleset for main:
- Require PR (no direct push)
- Require 1 approval
- Require status checks:
- gitleaks-scan
- codeql-scan
- unit-test
- container-analysis (Phase 1+)
- Require CODEOWNERS review
- Dismiss stale reviews on push
- Require linear history (no merge commit)
- No force push, no deletion
Ví dụ scenario:
- Dev mở PR sửa
src/api.ts. PR chạy 4 status checks. 3 pass, 1 fail. Merge button disabled. - Dev fix, push lại. Reviewer A đã approve trước đó → approval bị dismiss → cần re-approve.
Strength: Rulesets mạnh hơn Branch Protection, có thể apply cho nhiều branch pattern, override hierarchy Weakness: Cần GitHub Team plan, một số rules cần Enterprise
Compare Azure: Tương tự Azure Repos Branch Policies (Required Reviewers, Build Validation, Status Checks).
Là gì: Tạo file template template.yml trong từng repo, call reusable workflows từ singpost-cicd-components.
Solution structure:
# .github/workflows/template.yml in each repo
name: CI-CD Workflow
on:
push: { branches: [dev], paths: ['src/**', 'package*.json'] }
pull_request: { branches: [main, dev] }
workflow_dispatch:
permissions:
contents: read
security-events: write
id-token: write
jobs:
gitleaks-scan:
uses: SINGAPORE-POST-LIMITED/singpost-cicd-components/.github/workflows/reusable-workflow-gitleaks.yml@main
with: { scan_mode: full }
secrets: inherit
codeql-scan:
uses: SINGAPORE-POST-LIMITED/singpost-cicd-components/.github/workflows/reusable-workflow-codeql.yml@main
secrets: inherit
sonar-scan:
uses: SINGAPORE-POST-LIMITED/singpost-cicd-components/.github/workflows/reusable-workflow-sonarqube.yml@main
with: { PROJECTID: 'singpost-app-xxx' }
secrets: inherit
checkov-scan:
if: contains(github.repository, 'infra')
uses: SINGAPORE-POST-LIMITED/singpost-cicd-components/.github/workflows/reusable-workflow-checkov.yml@main
secrets: inheritKey concept:
uses: org/repo/path@ref= cross-repo reusable workflow. Chỉ thay đổi file ởsingpost-cicd-components, tất cả repo gọi đến đều áp dụng update (DRY principle).secrets: inherit= pass-through secrets từ caller repo sang called workflow (default thì reusable workflow không thấy secrets).if: contains(github.repository, 'infra')= chỉ chạy checkov cho repo có chữinfratrong tên → skip cho app repo (tiết kiệm runner time).
Ví dụ: Dev push code lên singpost-app-dsb-TrackingIngestAPI → workflow run: gitleaks + codeql + sonar (skip checkov vì repo không chứa infra).
Strength:
- DRY — viết 1 lần, dùng 5 nơi
- Easy version control — pin
@mainhoặc@v1.2.3cho rollback - Centralized maintenance
Weakness:
- Cross-repo workflow phải public hoặc cùng org
@mainkhông stable, nên dùng tag@v1cho prod- Debug khó hơn (phải xem 2 file)
Là gì: Template workflow cho stage Build & Registry. Mỗi tool một reusable workflow.
Solution:
# reusable-cloud-build.yml
name: Cloud Build
on:
workflow_call:
inputs:
project_id: { required: true, type: string }
image_name: { required: true, type: string }
jobs:
build:
runs-on: ubuntu-latest
permissions: { id-token: write, contents: read }
steps:
- uses: actions/checkout@v4
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: ${{ secrets.BUILD_SA_EMAIL }}
- run: gcloud builds submit --config=cloudbuild.yaml --substitutions=_IMAGE=${{ inputs.image_name }}
# reusable-flyway.yml
name: Flyway Migrate
on:
workflow_call:
inputs:
env: { required: true, type: string }
secrets:
DB_PASSWORD: { required: true }
jobs:
migrate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: |
docker run --rm \
-v $PWD/db/migration:/flyway/sql \
flyway/flyway:10 \
-url=jdbc:postgresql://${{ vars.DB_HOST }}/${{ inputs.env }} \
-user=${{ vars.DB_USER }} \
-password=${{ secrets.DB_PASSWORD }} \
migrateStrength of mock workflow setup: Detect lỗi syntax sớm, không cần GCP access thật Weakness: Không validate thật business logic, chỉ là skeleton
Apigee:
jobs:
apigee-deploy:
steps:
- run: |
npm install -g apigeelint apickli
apigeelint -s apiproxy/ -f stylish.js
mvn -f pom.xml install -P${{ inputs.env }} \
-Dapigee.config.options=update \
-Dpassword=${{ secrets.APIGEE_TOKEN }}Cloud Deploy GKE:
jobs:
release:
steps:
- uses: google-github-actions/auth@v2 # WIF
- run: |
gcloud deploy releases create release-${{ github.sha }} \
--delivery-pipeline=app-gke-pipeline \
--region=asia-southeast1 \
--images=app=us-docker.pkg.dev/PROJECT/app/app:${{ github.sha }}Terraform:
jobs:
plan:
steps:
- uses: hashicorp/setup-terraform@v3
- run: terragrunt run-all plan
apply:
if: github.ref == 'refs/heads/main'
needs: plan
environment: prod # GATE manual approval
steps:
- run: terragrunt run-all apply -auto-approveSolution:
- name: Notify Google Chat
if: always()
uses: google-github-actions/notify-chat@v1
with:
webhook: ${{ secrets.GCHAT_WEBHOOK }}
message: "Deploy ${{ job.status }} for ${{ github.repository }}"Strength: Free, native (vì SingPost dùng GWorkspace) Weakness: Không có acknowledge/silence như PagerDuty, chỉ informational
Subtask 1.14-1.18: Testing workflow per repo — App Tracing, App Ingestion, Infra, SFTP, GitOps — 4h each ✅
Mỗi repo có characteristics khác:
- App Tracing/Ingestion: Node.js → Vitest unit test, Supertest integration
- Infra (Terraform): terraform validate, plan, fmt check
- SFTP: Có thể là Python service hoặc Node, custom test cho file transfer
- GitOps: Validate Helm chart, yamllint, kustomize build
Solution generic:
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20', cache: 'npm' }
- run: npm ci
- run: npm run lint
- run: npm run test:unit -- --coverage
- run: npm run test:integration
- uses: actions/upload-artifact@v4
with:
name: coverage
path: coverage/Ví dụ break-down 4h per repo:
- 1h: setup template
- 1h: test mock workflow (intentional fail scenarios)
- 1h: integrate với existing tests
- 1h: doc/PR open
Tổng estimation: 40h (~5 man-days). Mục đích: SCA gate trên PR — block dependency CVE + license issue.
Là gì: Reusable workflow file reusable-workflow-dependency-review.yml trong repo singpost-cicd-components.
Solution:
# reusable-workflow-dependency-review.yml
name: Dependency Review (Reusable)
on:
workflow_call:
inputs:
config-file:
required: false
type: string
default: './.github/dependency-review-config.yml'
permissions:
contents: read
pull-requests: write
jobs:
dependency-review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/dependency-review-action@v4
with:
config-file: ${{ inputs.config-file }}
comment-summary-in-pr: alwaysStrength: Native GitHub action, integrate sâu với Security tab Weakness: Yêu cầu GitHub Advanced Security (paid add-on) cho private repos. SingPost còn Free plan → BLOCKER (xem 2.5).
Là gì: File dependency-review-config.yml định nghĩa rules scan.
Solution sample:
# .github/dependency-review-config.yml
fail-on-severity: high # block HIGH/CRITICAL CVE
fail-on-scopes: [runtime] # chỉ check runtime deps (skip dev)
license-check: true
allow-licenses: # whitelist
- MIT
- Apache-2.0
- BSD-3-Clause
- ISC
- BSD-2-Clause
deny-licenses: # blacklist
- GPL-3.0
- AGPL-3.0
allow-dependencies-licenses: # exception cho 1 dep
- 'pkg:npm/some-gpl-pkg@1.0.0' # nếu cần exempt
comment-summary-in-pr: always
warn-only: false
show-openssf-scorecard: true
warn-on-openssf-scorecard-level: 3 # warn nếu score < 3
show-patched-versions: trueNote from task: "Currently only provide sample config for CVSS, OSI license guidelines, and OpenSSF". Nghĩa là team mới có sample, chưa được client approve rules cuối.
Ví dụ trigger: PR thêm axios@0.21.0 (có CVE-2021-3749 HIGH severity) → action fail → comment trên PR:
❌ axios@0.21.0 has 1 HIGH severity vulnerability:
- CVE-2021-3749: Regular expression denial of service
- Patched in: 0.21.2+
Là gì: Import reusable workflow vào caller template.yml của từng repo.
Solution:
# In caller template.yml
jobs:
dependency-review:
if: github.event_name == 'pull_request' # chỉ chạy trên PR
uses: SINGAPORE-POST-LIMITED/singpost-cicd-components/.github/workflows/reusable-workflow-dependency-review.yml@main
with:
config-file: './.github/dependency-review-config.yml'Quan trọng: if: github.event_name == 'pull_request' — Dependency Review action chỉ work trên PR (so sánh diff giữa base và head).
Status: Chờ client confirm rules. Reference: https://github.com/actions/dependency-review-action.
Solution flow:
- Schedule meeting với client security team
- Confirm:
fail-on-severitylevel (HIGH? CRITICAL only?)- License whitelist/blacklist
- OpenSSF Scorecard threshold
- Exceptions list
- Update config file, PR review
- Roll out
Ví dụ trade-off discussion với client:
- "Nếu set
fail-on-severity: high, ~30% PR có thể bị block ban đầu. Nên start vớimediumwarn-only 2 tuần, sau đó tighten."
BLOCKER:
- Client cần upgrade GitHub Free → Team plan
- Client cần mua GitHub Code Security Add-on (Advanced Security)
Lý do: Dependency Review Action cho private repo yêu cầu Advanced Security license. Trên public repo thì free, nhưng SingPost repos đều private.
Pricing reference (cần verify):
- GitHub Team: $4/user/month
- GitHub Advanced Security: ~$49/user/month (chỉ committers)
Next step: Wait for client procurement approval. Trong khi chờ, có thể:
- Demo trên personal account / sandbox repo (đã làm)
- Prepare alternative: Snyk (commercial, $$) hoặc OWASP Dependency-Check (OSS, ít feature hơn)
Identical pattern: Same as 2.5 nhưng cho repo khác. Cần unblock 2.5 trước.
Câu trả lời cho meeting nếu sếp hỏi: "Tại sao Lim chậm progress?" → "Bị block bởi GitHub plan upgrade & Code Security add-on của client. Đã raise blocker tuần X. Trong khi chờ, đã chuẩn bị xong template + config, sẵn sàng deploy ngay khi unblock."
Tổng estimation: ~62h. Mục đích: SAST gate.
Là gì: Enable CodeQL scanning trên GitHub repo Settings.
Solution:
- Repo → Settings → Code security and analysis → Code scanning → Set up CodeQL
- Hoặc add workflow file
.github/workflows/codeql.yml:
name: CodeQL
on:
push: { branches: [main, dev] }
pull_request: { branches: [main] }
schedule: [{ cron: '0 6 * * 1' }] # Weekly Monday 6am
jobs:
analyze:
runs-on: ubuntu-latest
permissions:
security-events: write
contents: read
strategy:
matrix: { language: [javascript-typescript] }
steps:
- uses: actions/checkout@v4
- uses: github/codeql-action/init@v3
with:
languages: ${{ matrix.language }}
queries: security-extended # mở rộng hơn default
- uses: github/codeql-action/analyze@v3Phase 3 enhancement: Add custom PII query pack (custom CodeQL queries detect PII handling violations).
Strength: Native GH, không cần separate infra, kết quả vào Security tab UI Weakness: Chậm cho repo lớn (10+ min/run), Datalog query khó học, false positive nhiều
BLOCKER giống Task 2: Cần GitHub Code Security Add-on cho private repos.
Test trên personal account vì client repo chưa enable add-on.
Status INVALID: Vì missing SONAR_TOKEN và SONAR_HOST_URL. Không thể test mà không có instance.
Sample template (đã code, INVALID vì chưa run được):
name: SonarQube Scan (Reusable)
on:
workflow_call:
inputs:
PROJECTID: { required: true, type: string }
config-path: { required: false, type: string, default: '.' }
secrets:
SONAR_TOKEN: { required: true }
SONAR_HOST_URL: { required: true }
jobs:
sonar:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 } # Cần full history cho blame info
- uses: sonarsource/sonarqube-scan-action@master
env:
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
SONAR_HOST_URL: ${{ secrets.SONAR_HOST_URL }}
with:
projectBaseDir: ${{ inputs.config-path }}
args: >
-Dsonar.projectKey=${{ inputs.PROJECTID }}
-Dsonar.qualitygate.wait=trueTương tự, không thể import vì chưa có instance.
Là gì: Nếu client không có Sonar Cloud account → tự host SonarQube.
Requirements (must have từ client):
- VM/server (min 4 CPU, 8GB RAM, 50GB disk)
- SonarQube license Developer Edition trở lên (~$150/dev/year) → để scan PR (Community edition không scan branch/PR)
- Domain + SSL cert (sonarqube.singpost.com)
- Postgres DB cho Sonar data persistence
Solution self-host steps:
- Provision VM (Compute Engine), open port 9000
- Install docker:
docker run -d -p 9000:9000 sonarqube:9-developer - Configure DB connection (Cloud SQL Postgres)
- Generate token: User → My Account → Security → Generate Token → save as
SONAR_TOKEN - Save
SONAR_HOST_URL=https://sonarqube.singpost.com - Inject secrets vào GitHub org/repo
Trade-off discussion:
- Self-host: One-time setup, no per-PR cost, có thể custom rules
- SonarCloud (SaaS): Zero ops, nhưng $$, có thể không meet data residency Singapore
Câu trả lời cho meeting: "Nếu client không trả lời về SonarQube credentials trong 2 tuần, nên drop SonarQube vì CodeQL + ESLint đã cover 70% use case. Có thể move estimation sang task khác."
Tổng estimation: 32h. Mục đích: Detect & block secrets in code.
Solution (đây là file đã có trong repo SingPost):
name: Gitleaks Scan (Reusable)
on:
workflow_call:
inputs:
config-path: { required: false, type: string }
scan_mode: { required: false, type: string, default: "pr" } # 'pr' diff vs 'full' all
base-ref: { required: false, type: string, default: "origin/main" }
fail-on-detection: { required: false, type: boolean, default: true }
secrets:
GITLEAKS_LICENSE: { required: false }
jobs:
gitleaks:
runs-on: ubuntu-latest
permissions:
contents: read
security-events: write
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 } # cần full history
# Checkout cicd-components repo to get default config
- uses: actions/checkout@v4
with:
repository: SINGAPORE-POST-LIMITED/singpost-cicd-components
path: reusable
fetch-depth: 1
- name: Resolve config
id: cfg
run: |
if [ -n "${{ inputs.config-path }}" ]; then
echo "config=${{ inputs.config-path }}" >> $GITHUB_OUTPUT
elif [ -f .gitleaks.toml ]; then
echo "config=.gitleaks.toml" >> $GITHUB_OUTPUT
else
echo "config=reusable/.github/config/.gitleaks.toml" >> $GITHUB_OUTPUT
fi
- name: Install Gitleaks
run: |
curl -sL https://github.com/gitleaks/gitleaks/releases/download/v8.18.0/gitleaks_8.18.0_linux_x64.tar.gz | tar xz
sudo mv gitleaks /usr/local/bin/
- name: Run Gitleaks
run: |
if [ "${{ inputs.scan_mode }}" = "pr" ]; then
gitleaks detect --config=${{ steps.cfg.outputs.config }} \
--log-opts="${{ inputs.base-ref }}..HEAD" \
--report-format sarif --report-path gitleaks.sarif
else
gitleaks detect --config=${{ steps.cfg.outputs.config }} \
--report-format sarif --report-path gitleaks.sarif
fi
- uses: github/codeql-action/upload-sarif@v3
if: always()
with: { sarif_file: gitleaks.sarif }
- uses: actions/upload-artifact@v4
if: always()
with: { name: gitleaks-report, path: gitleaks.sarif }Key concepts:
fetch-depth: 0: Cần full git history để scan toàn bộ commits (default fetch chỉ shallow 1)scan_mode: pr vs full: PR mode scan diff (nhanh), full mode scan all (slow, baseline scan)- Priority config: input → repo-local
.gitleaks.toml→ fallback default từsingpost-cicd-components - Upload SARIF → vào Security tab GitHub UI
Là gì: Custom rules ngoài default. Reference: https://github.com/gitleaks/gitleaks/blob/master/config/gitleaks.toml
Sample:
# .gitleaks.toml
[extend]
useDefault = true # extend default rules
[[rules]]
id = "singpost-internal-token"
description = "SingPost Internal Service Token"
regex = '''SP_INT_TOKEN_[A-Z0-9]{32}'''
tags = ["singpost", "internal"]
keywords = ["SP_INT_TOKEN"]
[[rules]]
id = "apigee-mgmt-key"
description = "Apigee Management API Key"
regex = '''[a-z0-9]{32}\.apigee\.net'''
tags = ["apigee"]
[allowlist]
description = "Test fixtures"
paths = [
'''tests/fixtures/.*\.json''',
'''docs/examples/.*'''
]
regexes = [
'''AKIA[A-Z0-9]{16}''' # fake AWS key in docs
]Ví dụ trigger: Dev paste SP_INT_TOKEN_ABC123... vào code → Gitleaks fail pipeline với:
Finding: SP_INT_TOKEN_ABC123...
RuleID: singpost-internal-token
File: src/api/client.ts:42
Là gì: Negative test — tạo branch cố tình có secret để verify Gitleaks detect được.
Solution:
git checkout -b test/gitleaks-detection
echo "AWS_KEY=AKIAIOSFODNN7EXAMPLE" > .env.test
git add .env.test && git commit -m "test: fake secret"
git push origin test/gitleaks-detection
# Open PR → expect Gitleaks failStatus timeline:
- ✅ App Tracing: Done
- ⏳ App Ingestion: In-Progress
- 🆕 SFTP: New
- 🆕 DSB Infra: New
- 🆕 GitOps: New
Solution per repo:
- Add caller workflow file (template.yml) trong repo target
- Cấu hình paths trigger phù hợp với repo (Infra repo có thêm
terraform/**) - Run on feature branch first
- Open PR → verify workflow chạy → merge
- Document trong README repo
Strength: Repo cuối làm nhanh hơn vì pattern lặp
Weakness: Mỗi repo có characteristics riêng (Infra repo có *.tf, GitOps repo có .yaml k8s manifests) → có thể cần tune rules
Tổng estimation: 80h (10 man-days). Mục đích: IaC provisioning cho 5 environments.
Là gì: Define structure cho singpost-infra repo. Lựa chọn nhiều: monorepo vs multi-repo, state per env vs global, tag strategy.
Solution decisions:
singpost-infra/
├── modules/ # Reusable modules (input → output)
│ ├── gke-cluster/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── README.md
│ ├── networking/ # VPC, subnets, firewall
│ ├── cloud-sql/
│ ├── artifact-registry/
│ └── cloud-run/
├── live/ # Per-env deployments
│ ├── _envcommon/ # Shared config
│ ├── dev/
│ │ ├── env.hcl # Dev-specific vars
│ │ ├── gcp/
│ │ │ ├── networking/terragrunt.hcl
│ │ │ ├── gke/terragrunt.hcl
│ │ │ └── cloud-sql/terragrunt.hcl
│ ├── sit/
│ ├── uat/
│ ├── preprod/
│ └── prod/
├── policies/ # OPA/Conftest policies
├── .github/workflows/
│ ├── plan.yml
│ └── apply.yml
├── terragrunt.hcl # Root config (remote state, providers)
└── CODEOWNERS
State management:
- Backend: GCS bucket
singpost-tfstateper env (or per project) - Versioning enabled trên bucket (rollback state nếu corrupt)
- State locking qua Cloud Storage object lock (Terraform 1.10+)
Tagging strategy (label trên resource):
locals {
common_tags = {
environment = local.environment # dev/sit/uat/preprod/prod
project = "singpost-logistics"
owner = "platform-team"
cost-center = "PLAT-001"
managed-by = "terraform"
git-sha = local.git_sha
}
}Compare Azure: Tương tự Azure Resource Manager + tags trên resource group + terraform azurerm backend storage account.
Strength:
- Clear separation per env
- Terragrunt DRY (define module 1 lần, use nhiều env với vars khác)
- State per env = blast radius nhỏ (mistake dev không sập prod)
Weakness:
- Cross-env dependency phức tạp (e.g. dev VPC depend prod DNS zone)
- Terragrunt là extra layer learning curve
- State file management cho multi-cloud (GCP + Azure dual-cloud per ADR-005) phức tạp
Là gì: Code reusable Terraform modules.
Module GKE example:
# modules/gke-cluster/main.tf
resource "google_container_cluster" "primary" {
name = var.cluster_name
location = var.region
enable_autopilot = true # Phase M dùng Autopilot
network = var.vpc_self_link
subnetwork = var.subnet_self_link
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}
master_authorized_networks_config {
dynamic "cidr_blocks" {
for_each = var.authorized_networks
content {
cidr_block = cidr_blocks.value.cidr
display_name = cidr_blocks.value.name
}
}
}
resource_labels = var.labels
deletion_protection = var.environment == "prod" ? true : false
}
# variables.tf
variable "cluster_name" { type = string }
variable "region" { type = string, default = "asia-southeast1" }
variable "vpc_self_link" { type = string }
variable "subnet_self_link" { type = string }
variable "authorized_networks" { type = list(object({ cidr = string, name = string })) }
variable "labels" { type = map(string), default = {} }
variable "environment" { type = string }
variable "project_id" { type = string }
# outputs.tf
output "cluster_endpoint" { value = google_container_cluster.primary.endpoint, sensitive = true }
output "cluster_ca_certificate" { value = google_container_cluster.primary.master_auth[0].cluster_ca_certificate, sensitive = true }Module GAR example:
resource "google_artifact_registry_repository" "main" {
location = var.region
repository_id = var.repo_id
format = "DOCKER"
description = var.description
labels = var.labels
cleanup_policies {
id = "keep-recent-tagged"
action = "KEEP"
most_recent_versions { keep_count = 10 }
}
cleanup_policies {
id = "delete-old-untagged"
action = "DELETE"
condition { tag_state = "UNTAGGED", older_than = "604800s" } # 7 days
}
}Module DB (Cloud SQL Postgres) example:
resource "google_sql_database_instance" "main" {
name = "${var.environment}-${var.db_name}"
database_version = "POSTGRES_15"
region = var.region
settings {
tier = var.environment == "prod" ? "db-custom-4-15360" : "db-custom-2-7680"
availability_type = var.environment == "prod" ? "REGIONAL" : "ZONAL"
disk_size = var.environment == "prod" ? 500 : 100
disk_autoresize = true
backup_configuration {
enabled = true
point_in_time_recovery_enabled = var.environment == "prod"
start_time = "02:00"
transaction_log_retention_days = var.environment == "prod" ? 7 : 1
}
ip_configuration {
ipv4_enabled = false # Private only
private_network = var.vpc_self_link
}
insights_config {
query_insights_enabled = true
}
}
deletion_protection = var.environment == "prod" ? true : false
}Note from task: "Still depend on the Application Architecture -> the effort can be changed." → Có nghĩa nếu architecture đổi (e.g. switch GKE → Cloud Run), module phải refactor.
Là gì: Apply Terragrunt wrapper để DRY module calls.
Solution:
# terragrunt.hcl (root)
remote_state {
backend = "gcs"
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
config = {
bucket = "singpost-tfstate-${local.env}"
prefix = "${path_relative_to_include()}/terraform.tfstate"
project = "singpost-platform"
location = "asia-southeast1"
}
}
# Auto-generate provider block per env
generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
provider "google" {
project = "${local.project_id}"
region = "${local.region}"
}
EOF
}
locals {
env_vars = read_terragrunt_config(find_in_parent_folders("env.hcl"))
env = local.env_vars.locals.environment
project_id = local.env_vars.locals.project_id
region = local.env_vars.locals.region
}
# live/dev/env.hcl
locals {
environment = "dev"
project_id = "singpost-dev-12345"
region = "asia-southeast1"
network_cidr = "10.10.0.0/16"
}
# live/dev/gcp/gke/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "../../../../modules/gke-cluster"
}
dependency "networking" {
config_path = "../networking"
mock_outputs = { # for plan-time when networking not yet applied
vpc_self_link = "mock-vpc"
subnet_self_link = "mock-subnet"
}
}
inputs = {
cluster_name = "singpost-dev-gke"
vpc_self_link = dependency.networking.outputs.vpc_self_link
subnet_self_link = dependency.networking.outputs.subnet_self_link
environment = "dev"
project_id = "singpost-dev-12345"
authorized_networks = [
{ cidr = "10.0.0.0/8", name = "internal" }
]
}Key concepts:
include "root"= inherit root terragrunt.hcl configdependency= reference output từ module khác (Terragrunt tự xử lý order)mock_outputs= giả lập output choplankhi dep chưa applyread_terragrunt_config= đọc shared config từ env.hcl
Note from task: "Need Infra Architecture to update variable value (network info, server sizing, firewall rules,...)" — phụ thuộc Architecture team cung cấp spec thật.
Solution flow per env:
- Update
live/{env}/env.hclvới spec từ architect - Update network info (CIDR, peering)
- Update server sizing per env (e.g. dev = small, prod = large)
- Update firewall rules
terragrunt run-all planđể preview- Review plan output trong PR
- Merge →
terragrunt run-all apply
Ví dụ env-specific sizing:
| Resource | Dev | SIT | UAT | Pre-Prod | Prod |
|---|---|---|---|---|---|
| GKE node pool | n1-standard-2 × 1 | n1-standard-2 × 2 | n1-standard-4 × 2 | n1-standard-4 × 3 | n1-standard-8 × 3-10 |
| Cloud SQL | db-f1-micro | db-custom-1-3840 | db-custom-2-7680 | db-custom-2-7680 HA | db-custom-4-15360 HA |
| Backup | Daily 1 day | Daily 3 days | Daily 7 days | Daily 7 days | Daily 30 days + PITR |
Là gì: WIF setup cho Terraform.
Prerequisites từ client:
- Create Service Account
terraform-sa@singpost-platform.iam.gserviceaccount.com - Grant roles:
roles/editor(provision resource)roles/resourcemanager.projectIamAdmin(manage IAM)roles/iam.serviceAccountAdmin(create SA for new resources)
- Create GCS bucket
singpost-tfstatevới Object Versioning enabled - Grant
roles/storage.objectAdmintoterraform-sa - GCP Console Access cho team:
roles/viewer+roles/browser
Solution WIF setup:
# 1. Create Workload Identity Pool
gcloud iam workload-identity-pools create "github-pool" \
--location="global" \
--display-name="GitHub Pool"
# 2. Create Provider (trust GitHub OIDC)
gcloud iam workload-identity-pools providers create-oidc "github-provider" \
--location="global" \
--workload-identity-pool="github-pool" \
--display-name="GitHub Provider" \
--attribute-mapping="google.subject=assertion.sub,attribute.actor=assertion.actor,attribute.repository=assertion.repository" \
--issuer-uri="https://token.actions.githubusercontent.com"
# 3. Allow GitHub repo to impersonate terraform-sa
gcloud iam service-accounts add-iam-policy-binding terraform-sa@PROJECT.iam.gserviceaccount.com \
--role="roles/iam.workloadIdentityUser" \
--member="principalSet://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/github-pool/attribute.repository/SINGAPORE-POST-LIMITED/singpost-infra"Strength: Keyless, secure, audit-able
Weakness: Setup phức tạp lần đầu, debug khó (lỗi Permission Denied không rõ ràng)
Là gì: Tích hợp Terraform vào GitHub Actions workflow.
Solution:
# .github/workflows/terragrunt.yml
name: Terragrunt
on:
pull_request:
paths: ['modules/**', 'live/**']
push:
branches: [main]
paths: ['modules/**', 'live/**']
jobs:
detect-changes:
runs-on: ubuntu-latest
outputs:
envs: ${{ steps.detect.outputs.envs }}
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- id: detect
run: |
# Detect changed env directories
CHANGED=$(git diff --name-only HEAD~1 HEAD | grep '^live/' | awk -F/ '{print $2}' | sort -u | jq -R . | jq -s .)
echo "envs=$CHANGED" >> $GITHUB_OUTPUT
plan:
needs: detect-changes
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
permissions: { id-token: write, contents: read, pull-requests: write }
strategy:
matrix:
env: ${{ fromJson(needs.detect-changes.outputs.envs) }}
steps:
- uses: actions/checkout@v4
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: terraform-sa@singpost-platform.iam.gserviceaccount.com
- uses: hashicorp/setup-terraform@v3
- run: |
cd live/${{ matrix.env }}
terragrunt run-all init --terragrunt-non-interactive
terragrunt run-all plan --terragrunt-non-interactive -out=plan.tfplan
- uses: actions/github-script@v7
with:
script: |
// Post plan output as PR comment
const fs = require('fs');
const plan = fs.readFileSync('plan.txt', 'utf8');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `### Terragrunt Plan for ${{ matrix.env }}\n\`\`\`\n${plan}\n\`\`\``
});
apply:
needs: [detect-changes]
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
permissions: { id-token: write, contents: read }
strategy:
matrix:
env: ${{ fromJson(needs.detect-changes.outputs.envs) }}
max-parallel: 1 # Apply sequentially
environment: ${{ matrix.env }} # GitHub Environment per env for approval
steps:
- uses: actions/checkout@v4
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: terraform-sa@singpost-platform.iam.gserviceaccount.com
- uses: hashicorp/setup-terraform@v3
- run: |
cd live/${{ matrix.env }}
terragrunt run-all apply --terragrunt-non-interactive -auto-approveKey concepts:
detect-changesjob: scope blast radius - chỉ run env changedenvironment: ${{ matrix.env }}: GitHub Environment với required reviewer cho preprod/prodmax-parallel: 1: prevent state conflictid-token: write: required for WIF OIDC
Là gì: End-to-end testing toàn pipeline.
Test scenarios:
- PR happy path: change
live/dev/gcp/networking/→ plan comment xuất hiện → merge → apply success - PR breaking change: change module backwards-incompatible → plan show DESTROY → require manual review
- Multi-env PR: change
modules/gke/ảnh hưởng 5 env → 5 plans parallel - Apply conflict: 2 PR cùng touch state → second fail với lock error
- WIF auth fail: SA missing role → clear error message
- Drift detection: manual change trong console → next plan show drift
Strength: Comprehensive coverage Weakness: Cần GCP project thật để test, tốn budget
Tổng estimation: ~80h. Mục đích: Build container images cho 4 app repos.
Task này split thành 2 sub-tools: Cloud Build (cho GCP project) và Docker build on GH Actions runner (alternative cho non-GCP).
Prerequisites từ client:
- GCP project access
- Permission to enable APIs:
run.googleapis.com,cloudbuild.googleapis.com,artifactregistry.googleapis.com,iamcredentials.googleapis.com,sts.googleapis.com - Permission to create Service Accounts
- IAM admin access
- WIF configuration access
Solution:
# Terraform để tạo Cloud Run + runtime SA
resource "google_service_account" "cloud_run_runtime" {
account_id = "cloud-run-runtime-${var.env}"
display_name = "Cloud Run Runtime SA - ${var.env}"
}
resource "google_cloud_run_v2_service" "default" {
name = "app-${var.env}"
location = "asia-southeast1"
template {
service_account = google_service_account.cloud_run_runtime.email
vpc_access {
egress = "PRIVATE_RANGES_ONLY"
network_interfaces {
network = var.vpc_id
subnetwork = var.subnet_id
}
}
containers {
image = "us-docker.pkg.dev/PROJECT/REPO/app:placeholder" # Cloud Deploy will swap
resources {
limits = { cpu = "1", memory = "512Mi" }
}
}
}
# Cloud Deploy will manage traffic
lifecycle {
ignore_changes = [template[0].containers[0].image, traffic]
}
}Là gì: SA tách biệt cho từng job step (principle of least privilege).
Solution:
# Build SA - only push to GAR
gcloud iam service-accounts create build-sa --display-name "Build SA"
gcloud projects add-iam-policy-binding PROJECT \
--member="serviceAccount:build-sa@PROJECT.iam.gserviceaccount.com" \
--role="roles/artifactregistry.writer"
gcloud projects add-iam-policy-binding PROJECT \
--member="serviceAccount:build-sa@PROJECT.iam.gserviceaccount.com" \
--role="roles/cloudbuild.builds.editor"
gcloud projects add-iam-policy-binding PROJECT \
--member="serviceAccount:build-sa@PROJECT.iam.gserviceaccount.com" \
--role="roles/logging.logWriter"
# Deploy SA - trigger Cloud Deploy
gcloud iam service-accounts create deploy-sa --display-name "Deploy SA"
gcloud projects add-iam-policy-binding PROJECT \
--member="serviceAccount:deploy-sa@PROJECT.iam.gserviceaccount.com" \
--role="roles/clouddeploy.releaser"
gcloud projects add-iam-policy-binding PROJECT \
--member="serviceAccount:deploy-sa@PROJECT.iam.gserviceaccount.com" \
--role="roles/clouddeploy.jobRunner"
gcloud projects add-iam-policy-binding PROJECT \
--member="serviceAccount:deploy-sa@PROJECT.iam.gserviceaccount.com" \
--role="roles/run.admin"
# Both SA need ServiceAccountUser on runtime SA (to attach it to Cloud Run instance)
gcloud iam service-accounts add-iam-policy-binding cloud-run-runtime-dev@PROJECT.iam.gserviceaccount.com \
--member="serviceAccount:deploy-sa@PROJECT.iam.gserviceaccount.com" \
--role="roles/iam.serviceAccountUser"Key concept: roles/iam.serviceAccountUser — bắt buộc để deploy SA "attach" runtime SA vào Cloud Run instance. Quên là lỗi Permission denied khó debug.
Identical pattern with Task 5.5 (Terraform). Reuse cùng pool, tạo provider riêng nếu cần.
Solution:
- GitHub repo Settings → Secrets and variables → Actions
WIF_PROVIDER:projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/github-pool/providers/github-providerBUILD_SA_EMAIL:build-sa@PROJECT.iam.gserviceaccount.comGCP_PROJECT_ID:singpost-platform-12345GAR_REGION:asia-southeast1
- Variables (non-sensitive):
GAR_REPO_NAME:app-imagesDEFAULT_REGION:asia-southeast1
- Workflow permissions: Settings → Actions → General → Workflow permissions → "Read and write permissions"
Là gì: Optimize Dockerfile cho cả local + CI.
Solution Dockerfile (multi-stage, distroless):
# Stage 1: Builder
FROM node:20-alpine AS builder
WORKDIR /app
# Cache deps layer
COPY package*.json ./
RUN npm ci --omit=dev
# Build app
COPY . .
RUN npm run build
# Stage 2: Runtime (distroless)
FROM gcr.io/distroless/nodejs20-debian12
WORKDIR /app
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/package.json ./
USER nonroot:nonroot
EXPOSE 8080
CMD ["dist/server.js"]Strength:
- Multi-stage: image cuối nhỏ (~150MB thay vì 1GB)
- Distroless: không có shell/package manager → giảm attack surface
--omit=dev: bỏ devDependencies (jest, eslint...)- Layer cache: nếu chỉ code đổi, deps layer reuse (build nhanh)
Local compat check:
# Build local
docker build -t app:local .
# Run local
docker run -p 8080:8080 app:local
# Test
curl http://localhost:8080/healthSolution:
# cloudbuild.yaml
steps:
# Step 1: Build image
- name: gcr.io/cloud-builders/docker
args:
- build
- --tag=$_AR_HOSTNAME/$_REPO/$_IMAGE:$COMMIT_SHA
- --tag=$_AR_HOSTNAME/$_REPO/$_IMAGE:latest
- --cache-from=$_AR_HOSTNAME/$_REPO/$_IMAGE:latest
- .
# Step 2: Push image
- name: gcr.io/cloud-builders/docker
args: ['push', '--all-tags', '$_AR_HOSTNAME/$_REPO/$_IMAGE']
# Step 3: Container Analysis scan (Phase 1+)
- name: gcr.io/cloud-builders/gcloud
entrypoint: bash
args:
- -c
- |
gcloud artifacts docker images scan $_AR_HOSTNAME/$_REPO/$_IMAGE:$COMMIT_SHA \
--format='value(response.scan)'
images:
- $_AR_HOSTNAME/$_REPO/$_IMAGE
substitutions:
_AR_HOSTNAME: asia-southeast1-docker.pkg.dev
_REPO: PROJECT_ID/app-images
_IMAGE: ${REPO_NAME}
options:
logging: CLOUD_LOGGING_ONLY
machineType: E2_HIGHCPU_8 # faster builds, more $$Solution:
# .github/workflows/build.yml
name: Build (Cloud Build)
on:
push:
branches: [main]
paths: ['src/**', 'Dockerfile', 'package*.json', 'cloudbuild.yaml']
jobs:
build:
runs-on: ubuntu-latest
permissions: { id-token: write, contents: read }
steps:
- uses: actions/checkout@v4
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: ${{ secrets.BUILD_SA_EMAIL }}
- uses: google-github-actions/setup-gcloud@v2
- run: |
gcloud builds submit \
--config=cloudbuild.yaml \
--substitutions=COMMIT_SHA=${{ github.sha }},REPO_NAME=${{ github.event.repository.name }} \
--region=asia-southeast1 \
--service-account=projects/$PROJECT/serviceAccounts/build-sa@$PROJECT.iam.gserviceaccount.comTest plan:
- Push commit → check GH Action triggered
- Action authenticate GCP via WIF (check log "Successfully authenticated as build-sa")
- Cloud Build job submitted (check Cloud Console → Cloud Build → History)
- Image appear in GAR với tag
git-sha-abc...andlatest - Container Analysis scan complete, vulnerability report viewable
- Total time < 5 min
Repeat pattern. App Ingestion + SFTP shorter vì reuse pattern.
Khi nào dùng: Project không trên GCP (Azure, AWS, on-prem), hoặc build nhanh không qua Cloud Build network.
Là gì: Verify Docker + buildx available trên ubuntu-latest runner.
Solution check:
- run: |
docker version # Docker installed
docker buildx version # buildx for multi-platform
docker infoNote: GitHub-hosted runner có Docker pre-installed. Self-hosted runner cần manual install.
Same as 6.4.
Same as 6.5.
Solution:
# .github/workflows/docker-build.yml
name: Docker Build (GH Runner)
on:
push:
branches: [main]
jobs:
build:
runs-on: ubuntu-latest
permissions: { id-token: write, contents: read }
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
# Login to GAR via WIF
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: ${{ secrets.BUILD_SA_EMAIL }}
token_format: 'access_token'
- uses: docker/login-action@v3
with:
registry: asia-southeast1-docker.pkg.dev
username: oauth2accesstoken
password: ${{ steps.auth.outputs.access_token }}
- uses: docker/build-push-action@v5
with:
context: .
push: true
platforms: linux/amd64,linux/arm64 # multi-arch
tags: |
asia-southeast1-docker.pkg.dev/PROJECT/REPO/app:${{ github.sha }}
asia-southeast1-docker.pkg.dev/PROJECT/REPO/app:latest
cache-from: type=gha
cache-to: type=gha,mode=max
provenance: true # SLSA provenance for supply chainStrength of GH Runner build:
- Faster cold start (no Cloud Build VM provisioning)
- GH Actions cache (
type=gha) reuse layer - Multi-arch (amd64 + arm64) built-in
- SLSA provenance attestation (supply chain security)
Weakness:
- Network egress out of GCP để push GAR (slow nếu image lớn)
- GH-hosted runner 2 CPU, 7GB RAM (Cloud Build có high-cpu machine type)
- Free tier 2000 min/month (charge $0.008/min after)
Là gì: Test failure scenarios → ensure clear error in workflow log.
Scenarios test:
- Network timeout: kill registry mid-push
- Auth fail: revoke SA permission
- Image too large: build 5GB image → ensure timeout/error
- Dockerfile syntax error: invalid RUN command
- Out of disk: fill runner disk → ensure cleanup
Tổng estimation: 70h. Mục đích: Continuous Delivery cho Cloud Run + GKE.
Prerequisites: Task 6 (Cloud Build) complete first. Need:
- Cloud Deploy IAM roles
- Cloud Run IAM roles
- WIF setup
- Artifact Registry create permission
Solution (overlap với Task 8 GAR):
gcloud artifacts repositories create app-images \
--repository-format=docker \
--location=asia-southeast1 \
--description="Application images for SingPost"Single env baseline: Tạo Cloud Run service skeleton (image placeholder, sẽ swap qua Cloud Deploy).
Multi-env (UAT, Preprod, Prod):
# Terraform per env
module "cloud_run_uat" {
source = "./modules/cloud-run"
env = "uat"
project_id = "singpost-uat-67890"
service_name = "app-uat"
runtime_sa = google_service_account.runtime_uat.email
}
module "cloud_run_preprod" { ... }
module "cloud_run_prod" { ... }Key consideration: Mỗi env có thể là:
- Separate GCP project (recommended cho prod isolation)
- Same project + different service name (lower env)
Là gì: Define Cloud Deploy clouddeploy.yaml — luồng promote.
Solution:
# clouddeploy/clouddeploy.yaml
apiVersion: deploy.cloud.google.com/v1
kind: DeliveryPipeline
metadata:
name: app-pipeline
description: Pipeline for app from dev to prod
serialPipeline:
stages:
- targetId: dev
profiles: [dev]
strategy:
standard:
verify: true # Run smoke test post-deploy
- targetId: uat
profiles: [uat]
strategy:
standard:
verify: true
- targetId: preprod
profiles: [preprod]
strategy:
canary:
runtimeConfig:
cloudRun:
automaticTrafficControl: true
canaryDeployment:
percentages: [25, 50]
verify: true
- targetId: prod
profiles: [prod]
strategy:
canary:
runtimeConfig:
cloudRun:
automaticTrafficControl: true
canaryDeployment:
percentages: [5, 25, 50]
verify: true
---
apiVersion: deploy.cloud.google.com/v1
kind: Target
metadata:
name: dev
description: Dev environment
run:
location: projects/singpost-dev/locations/asia-southeast1
---
apiVersion: deploy.cloud.google.com/v1
kind: Target
metadata:
name: prod
description: Prod environment
requireApproval: true # MANUAL APPROVAL gate
run:
location: projects/singpost-prod/locations/asia-southeast1
executionConfigs:
- usages: [RENDER, DEPLOY, VERIFY]
serviceAccount: deploy-sa-prod@singpost-prod.iam.gserviceaccount.comKey concepts:
serialPipeline: stages chạy tuần tự (dev → uat → preprod → prod)profiles: tham chiếu Skaffold profile để render manifest khác per envstrategy.canary: Phase 2+ progressive trafficrequireApproval: true: prod cần manual click "Approve" trong Cloud ConsoleexecutionConfigs: SA dedicated cho prod (least privilege)
2-layer approval:
- GitHub Environment (caller layer): required reviewers approve PR/deploy
- Cloud Deploy
requireApproval: true(deployment layer): approve trong GCP Console
Strength: Double check, audit trail rõ Weakness: Friction cao, dev complain "phải approve 2 lần?"
Trade-off discussion với client: Có thể disable GitHub Environment cho dev/sit, chỉ giữ cho preprod/prod.
Same pattern Task 6.2.
Same Task 5.5.
Là gì: Extend workflow để trigger Cloud Deploy sau build.
Solution:
jobs:
build:
# ... build & push image ...
outputs:
image_uri: ${{ steps.push.outputs.image_uri }}
release:
needs: build
runs-on: ubuntu-latest
permissions: { id-token: write, contents: read }
steps:
- uses: actions/checkout@v4
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: ${{ secrets.DEPLOY_SA_EMAIL }}
- run: |
gcloud deploy releases create release-${{ github.sha }} \
--delivery-pipeline=app-pipeline \
--region=asia-southeast1 \
--images=app=${{ needs.build.outputs.image_uri }} \
--description="Triggered by commit ${{ github.sha }}"Include clouddeploy.yaml + skaffold.yaml (Cloud Deploy dùng Skaffold để render manifest).
# skaffold.yaml
apiVersion: skaffold/v4beta7
kind: Config
profiles:
- name: dev
deploy:
cloudrun:
projectid: singpost-dev
region: asia-southeast1
manifests:
rawYaml: [manifests/dev/service.yaml]
- name: prod
deploy:
cloudrun:
projectid: singpost-prod
region: asia-southeast1
manifests:
rawYaml: [manifests/prod/service.yaml]Test plan:
- Trigger release từ GH Action
- Verify dev deploy auto
- Smoke test pass
- Manual promote dev → uat (verify approval works)
- Repeat for uat → preprod → prod
- Canary 5% → 25% → 100% works on preprod/prod
- Auto-rollback trigger if error rate spike (manually inject error)
Tổng estimation: ~80h. Note: Phần lớn overlap với Task 6/7 nhưng GAR-focused.
Decisions:
- Naming:
{env}-{type}-{purpose}(e.g.prod-docker-apps,dev-helm-charts) - Region:
asia-southeast1(Singapore data residency) - Format:
DOCKERcho images,DOCKER(OCI mode) cho Helm chart - Retention: Keep 10 most recent tagged + delete untagged > 7 days
- Cleanup policy: Run weekly cron
Solution:
# Method 1: gcloud helper (recommended)
gcloud auth configure-docker asia-southeast1-docker.pkg.dev
# Method 2: Service account key (avoid - use WIF instead)
cat key.json | docker login -u _json_key --password-stdin \
https://asia-southeast1-docker.pkg.devStrategy:
- Immutable tags:
git-sha-abc123(always, for rollback reference) - Semver:
v1.2.3(on release) - Environment:
dev-latest,prod-latest(mutable, current deployed) - Rollback tag:
prod-rollback-v1.2.2(manual tag for quick rollback)
Anti-pattern:
- ❌
latestcho prod (ambiguous, không rollback được) - ❌ Reuse tag (vd push lại
v1.2.3với content khác)
gcloud artifacts repositories set-cleanup-policies app-images \
--location=asia-southeast1 \
--policy=cleanup-policy.json[
{
"name": "keep-latest-10",
"action": {"type": "Keep"},
"mostRecentVersions": {"keepCount": 10}
},
{
"name": "delete-untagged-after-7d",
"action": {"type": "Delete"},
"condition": {
"tagState": "untagged",
"olderThan": "604800s"
}
}
]Tổng estimation: ~60h. Mục đích: IaC security scan.
Là gì: Reusable workflow Checkov, custom rules.
Solution reusable workflow:
# reusable-workflow-checkov.yml
name: Checkov Scan
on:
workflow_call:
inputs:
directory: { required: false, type: string, default: '.' }
framework: { required: false, type: string, default: 'all' } # terraform, kubernetes, helm, github_actions
soft-fail: { required: false, type: boolean, default: false }
jobs:
checkov:
runs-on: ubuntu-latest
permissions:
contents: read
security-events: write
steps:
- uses: actions/checkout@v4
- uses: bridgecrewio/checkov-action@master
with:
directory: ${{ inputs.directory }}
framework: ${{ inputs.framework }}
output_format: sarif
output_file_path: checkov.sarif
soft_fail: ${{ inputs.soft-fail }}
skip_check: CKV_AWS_999,CKV_GCP_TEST # known false positive
download_external_modules: true
- uses: github/codeql-action/upload-sarif@v3
if: always()
with: { sarif_file: checkov.sarif }Custom rules (file .checkov.yml):
framework:
- terraform
- kubernetes
- github_actions
skip-check:
- CKV_GCP_999 # known issue
check:
- CKV2_GCP_* # all GCP graph checks
# Custom Python rule example
custom-checks:
- dir: ./custom-checksSolution custom check (Python):
# custom_checks/ensure_oidc_used.py
from checkov.github_actions.checks.job.base_resource_check import BaseGithubActionsJobCheck
from checkov.common.models.enums import CheckCategories, CheckResult
class EnsureOIDCUsed(BaseGithubActionsJobCheck):
def __init__(self):
name = "Ensure GitHub Actions use OIDC (no long-lived secrets)"
id = "CKV_SINGPOST_1"
super().__init__(
name=name, id=id,
categories=[CheckCategories.SUPPLY_CHAIN]
)
def scan_resource_conf(self, conf):
permissions = conf.get('permissions', {})
if 'id-token' not in str(permissions):
return CheckResult.FAILED
return CheckResult.PASSED
check = EnsureOIDCUsed()Subtask 9.4-9.6: Define rules for Terraform/Terragrunt — 8h, Import to CI/CD — 16h, Import for Infra repo — 16h
Strength of Checkov:
- 1000+ built-in rules
- Multi-framework (TF, K8s, Helm, GH Actions, Dockerfile)
- SARIF output → Security tab
- Custom rule extensible
Weakness:
- False positive nhiều
- Slow trên monorepo lớn
- Một số rule outdated vs latest provider
Mục đích: DB schema migration.
Solution (đã sample ở Task 1.6-1.9):
# reusable-workflow-flyway.yml
name: Flyway Migrate
on:
workflow_call:
inputs:
environment: { required: true, type: string }
database: { required: true, type: string }
secrets:
DB_PASSWORD: { required: true }
jobs:
migrate:
runs-on: ubuntu-latest
permissions: { id-token: write, contents: read }
steps:
- uses: actions/checkout@v4
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: ${{ secrets.DB_MIGRATE_SA }}
# Cloud SQL Auth Proxy for private DB
- run: |
curl -o cloud-sql-proxy https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy/v2.7.0/cloud-sql-proxy.linux.amd64
chmod +x cloud-sql-proxy
./cloud-sql-proxy ${{ vars.DB_INSTANCE_NAME }} &
sleep 5
- name: Flyway Migrate
run: |
docker run --rm --network host \
-v $PWD/db/migration:/flyway/sql \
flyway/flyway:10 \
-url=jdbc:postgresql://localhost:5432/${{ inputs.database }} \
-user=${{ vars.DB_USER }} \
-password=${{ secrets.DB_PASSWORD }} \
-baselineOnMigrate=true \
-outOfOrder=false \
-validateOnMigrate=true \
migrate
- name: Verify migration
run: |
docker run --rm --network host flyway/flyway:10 \
-url=jdbc:postgresql://localhost:5432/${{ inputs.database }} \
-user=${{ vars.DB_USER }} \
-password=${{ secrets.DB_PASSWORD }} \
infoMigration file convention:
db/migration/
├── V1__init_schema.sql
├── V2__add_users_table.sql
├── V3__add_email_to_users.sql
└── V4__create_orders_table.sql
Rules:
V{N}__description.sql: versioned migration (cannot edit after applied)R__view_def.sql: repeatable (re-run if checksum change)- Sequential, no skip
Solution gotchas:
- Network access: GH Action runner public IP, DB private → need Cloud SQL Auth Proxy
- Backward compatibility: Phase 2+, prod DB migration must be backward-compat (app v1 still works after migration, before app v2 deploy)
- Lock contention: prod migration on Friday at 5pm = nightmare. Schedule during low-traffic window.
Đã cover ở Task 1.13. ~5 man-days estimation for full integration (multi-stage notification, formatted message, error details).
Là gì: Developer portal — software catalog, API catalog, golden path templates.
Subtask New:
- Deploy Backstage on GKE
- Configure GCP discovery plugin
- Catalog integration
- Anti-pattern lint rules (Phase 2)
Đã cover ở Task 6 (Cloud Build phase Container Analysis step). Standalone task estimate ~10 man-days.
Đã cover ở Task 1.10 + Stage 4.4. Full implementation 10 man-days.
| Person | Tasks | Status |
|---|---|---|
| Nick | GitHub Actions (Task 1) + CodeQL/SonarQube (Task 3) + Cloud Build (Task 6) | Bulk done cho Task 1, blocked Task 3 SonarQube |
| Lim | Dependency Review (Task 2) + Terraform/Terragrunt (Task 5) + Checkov (Task 9) | Blocked Task 2 (Code Security add-on), heavy Task 5 |
| Sean | Gitleaks/Secret Scanning (Task 4) + Container Analysis (Task 13) | Mid-progress Task 4 |
| Unassigned | Cloud Deploy GKE (Task 7), GAR (Task 8), Flyway (Task 10), GChat (Task 11), Backstage (Task 12), Apigee (Task 14) | All New |
- GitHub Code Security Add-on — block Task 2 (Dependency Review) + Task 3 (CodeQL on private repo)
- GitHub plan upgrade Free → Team — same as above
- SonarQube credentials (SONAR_TOKEN/HOST_URL or self-host server) — block Task 3 SonarQube subtasks
- GCP project access + IAM permission — block tất cả task GCP (Task 5, 6, 7, 8, 10, 13, 14)
- WIF setup permission — block Task 5, 6, 7
- Cloud SQL credentials + network path — block Task 10 (Flyway)
- Apigee org/env/host URLs + apigee-sa permission — block Task 14
- GKE cluster names + regions + namespaces — block Task 7
- GChat webhook URLs — block Task 11
- Designated PROD approvers list — block Task 7 approval gates
- 6 application repos names (vẫn placeholder repo-01..06)
| Task | Suggest assign | Rationale |
|---|---|---|
| Cloud Deploy GKE (Task 7) | Nick (đã có GCP exp) hoặc Sean | Build trên Task 6 đã làm |
| GAR (Task 8) | Nick | Overlap với Task 6 |
| Flyway (Task 10) | Lim | Lim biết DB/Terraform tốt |
| Google Chat (Task 11) | Sean | Quick win 5 man-days |
| Backstage (Task 12) | Defer Phase 2 | Cần expertise riêng, low priority |
| Apigee (Task 14) | Sean | Sean có time sau khi xong Gitleaks |
- Cloud Build vs GH Actions Docker build: Project tương lai có thể off-GCP → maintain cả 2 reusable workflow để dễ port
- SonarQube self-host vs drop: Tốn 2+ weeks setup. CodeQL+ESLint+CodeQL PII pack đã cover. Suggest drop hoặc dùng SonarCloud trial.
- Phase M vs Phase 1 priorities: Phase M (deploy được) > Phase 1 (security gate). Nếu trễ → drop Phase 1 features không critical (Dependency Review có thể defer, vì CodeQL đã có SAST).
- Single approval (GitHub Env) vs Double (GH Env + Cloud Deploy require approval): Friction high. Suggest single approval cho lower envs, double cho prod only.
- Estimation buffer: Tasks "New" với 0% progress = ước lượng từ analogy. Khả năng skew 30-50% nếu prerequisite không đủ. Apply 1.3x buffer cho New tasks.
Q: Tại sao tách 6 repo (4 app + 1 infra + 1 cicd-components) thay vì monorepo? A: ADR-005 quyết định separation of concerns: blast radius khác nhau (app daily, infra weekly), RBAC khác nhau (dev không touch infra), velocity khác nhau, ArgoCD watch 1 repo config tách biệt khỏi source code.
Q: Trunk-based có quá risky cho 80 services? A: Không, vì có 3 safety net: (1) Required PR + status checks, (2) Feature flags cho incomplete work, (3) Canary deployment + auto-rollback. Gitflow với 80 services sẽ là cherry-pick hell.
Q: WIF vs Service Account key? A: WIF keyless, token 15-min lifetime, audit-able. Service Account key risk: leak = permanent compromise. Best practice industry-wide.
Q: Cloud Run vs GKE — sao không chỉ chọn 1? A: Cloud Run cho stateless API (BFF/DSB/BIZ) — scale 0, pay per request, cheaper. GKE cho stateful workload (Temporal, RMQ, Backstage) — cần persistent connection, scheduled jobs, complex networking.
Q: Phase rollout 12 tuần có quá lâu? Đẩy nhanh được không? A: Phase M chỉ 2 tuần đã deploy được. Phase 1-3 là incremental safety. Nếu rush full features trong 4 tuần: risk team overload, miss WIF/IAM detail, security gap.
Q: Cost estimate? A: GitHub Team $4/user, Code Security $49/dev, GCP Cloud Build $0.003/build-min, Cloud Deploy free, GAR $0.10/GB-month, PagerDuty $21/user. Phase M budget thấp, Phase 3 cost cao do AI scan + soak test.
Q: Backup/DR plan? A: GCS bucket versioning cho Terraform state, Cloud SQL backup + PITR (prod), GAR retention policy giữ 10 latest, GitHub backup chính nó là DR (multi-region).
Q: Compliance — PDPA Singapore? A: Phase 2 có PII tokenization test + PDPA erasure test (BigQuery tombstone, Cloud SQL PII Vault delete). Backstage governance rule enforce no direct PII access.
- ADR-005: Branching Strategy
- SingPost CI/CD Architecture Guide
- SingPost CICD Project Documentation v1
- CI/CD Implementation Plan v10 (Phase M → 3)
- Infrastructure Requirements Doc
Prepared for [DevOps] Task Allocation Consolidation meeting